{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第二题：实现一个高斯朴素贝叶斯分类器"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "实验内容：\n",
    "1. 实现高斯朴素贝叶斯分类器\n",
    "2. 计算模型的查准率，查全率，F1值"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们要实现一个可以处理连续特征的，服从高斯分布的朴素贝叶斯分类器。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 符号\n",
    "\n",
    "给定训练集 $T$\n",
    "\n",
    "$$T = \\{(x_1, y_1), (x_2, y_2), ···, (x_N, y_N)\\}$$\n",
    "\n",
    "其中，$x$ 为样本的特征，$y$ 是该样本对应的标记，下标表示对应的是第几个样本，上标表示第几个特征。训练集 $T$ 内一共 $\\vert T \\vert = N$ 个样本。\n",
    "\n",
    "假设我们的任务是处理 $K$ 类分类任务，记类标记分别为 $c_1, c_2, ..., c_k$ 。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 目标\n",
    "\n",
    "我们的目标是对样本进行分类，这里我们用概率的方法，求 $P(Y = c_k \\mid X = x), \\ k = 1, 2, ..., K$ 中最大的那个概率对应的 $k$ 是哪个，也就是，给定样本 $x$ ，模型认为它是哪个类别的概率最大。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 原理\n",
    "\n",
    "由贝叶斯公式：\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "    P(Y = c_k \\mid X = x) &= \\frac{P(Y = c_k, X = x)}{P(X = x)} \\\\\n",
    "                  &= \\frac{P(X = x \\mid Y = c_k)P(Y = c_k)}{\\sum_kP(X = x \\mid Y = c_k)P(Y = c_k)} \\\\\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "这里，我们要求 $K$ 个概率中最大的那个，而这 $K$ 个概率的分母都相同，我们可以忽略分母部分，比较分子部分的大小，也就是比较 **先验概率** $P(Y = c_k)$ 和 **似然** $P(X = x \\mid Y = c_k)$ 的乘积。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "通过先验概率分布\n",
    "\n",
    "$$\n",
    "P(Y = c_k), \\ k = 1, 2, ..., K\n",
    "$$\n",
    "\n",
    "和条件概率分布\n",
    "\n",
    "$$\n",
    "P(X = x \\mid Y = c_k) = P(X^{(1)} = x^{(1)}, ···, X^{(n)} = x^{(n)} \\mid Y = c_k), \\ k = 1, 2, ..., K\n",
    "$$\n",
    "\n",
    "我们就可以得到联合概率分布 $P(X = x, Y = c_k)$ 。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**那么，问题就转化为了，如何求先验概率和似然？**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. 先验概率 $P(Y = c_k)$ ：\n",
    "\n",
    "先验概率的求解很简单，只要统计训练集中类别 $k$ 出现的概率即可。\n",
    "\n",
    "$$\n",
    "P(Y = c_k) = \\frac{\\mathrm{number} \\ \\mathrm{of}\\ c_k}{N}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. 似然 $P(X = x \\mid Y = c_k)$ ：\n",
    "\n",
    "求解这个条件概率比较复杂，**这里我们要假设特征之间相互独立**，可得\n",
    "\n",
    "$$P(X = x \\mid Y = c_k) = \\prod^n_{j=1}P(X^{(j)}=x^{(j)} \\mid Y = c_k)$$\n",
    "\n",
    "其中， $x^{(j)}$ 表示样本 $x$ 的第 $j$ 个特征。\n",
    "\n",
    "这样，复杂的条件概率就转换为了多个特征条件概率的乘积。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. 特征 $j$ 的条件概率 $P(X^{(j)}=x^{(j)} \\mid Y = c_k)$ ：\n",
    "\n",
    "因为我们处理的特征都是连续型特征，一般我们假设这些特征服从正态分布。\n",
    "\n",
    "当 $Y = c_k$ 时，$X^{(j)} = a_{jl}$ 的概率可由下面的公式计算得到：\n",
    "\n",
    "$$\n",
    "P(X^{(j)} = a_{jl} \\mid Y = c_k) = \\frac{1}{\\sqrt{2 \\pi \\sigma^2_{c_k,j}}} \\exp{\\bigg( - \\frac{(a_{jl} - \\mu_{c_k,j})^2}{2 \\sigma^2_{c_k,j}} \\bigg)}\n",
    "$$\n",
    "\n",
    "这里 $\\mu_{c_k,j}$ 和 $\\sigma^2_{c_k,j}$ 分别表示当 $Y = c_k$ 时，第 $j$ 个特征的均值和方差，**这个均值和方差都是通过训练集的样本计算出来的**。\n",
    "\n",
    "因为正态分布只需要两个参数（均值和方差）就可以确定，对于特征 $j$ 我们要估计 $K$ 个类别的均值和方差，所以特征 $j$ 的参数共有 $2K$个。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 综上\n",
    "\n",
    "朴素贝叶斯分类器可以表示为：\n",
    "\n",
    "$$\n",
    "y = \\mathop{\\arg\\max}_{c_k} P(Y = c_k) \\prod_j P(X^{(j)} = x^{(j)} \\mid Y = c_k)\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 实现\n",
    "实现的时候会遇到数值问题，在上面的条件概率连乘中，如果有几个概率值很小，它们的连乘就会导致下溢，解决方案就是将其改写为连加的形式。\n",
    "\n",
    "首先，我们的目标是：\n",
    "\n",
    "$$\n",
    "y = \\mathop{\\arg\\max}_{c_k} P(Y = c_k) \\prod_j P(X^{(j)} = x^{(j)} \\mid Y = c_k)\n",
    "$$\n",
    "\n",
    "比较这 $K$ 个数值的大小，然后取最大的那个数对应的 $k$。\n",
    "\n",
    "为了解决可能出现的下溢问题，我们对上面的式子取对数，因为是对 $K$ 项都取对数，不会改变单调性，所以取对数是不影响它们之间的大小关系的。\n",
    "\n",
    "那目标就变成了：\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    "y &= \\mathop{\\arg\\max}_{c_k} \\big[ \\log^{ \\ P(Y = c_k) \\prod_j P(X^{(j)} = x^{(j)} \\mid Y = c_k)} \\big] \\\\\n",
    "&= \\mathop{\\arg\\max}_{c_k} \\big[ \\log^{ \\ P(y = c_k)} + \\sum_j \\log^{ \\ P(X^{(j)} = x^{(j)} \\mid Y = c_k)} \\big]\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "在求条件概率的时候，也进行变换：\n",
    "\n",
    "$$\\begin{aligned}\n",
    "\\log^{ \\ P(X^{(j)} = x^{(j)} \\mid Y = c_k)} &= \\log^{ \\ \\bigg[\\frac{1}{\\sqrt{2 \\pi \\sigma^2_{c_k,j}}} \\exp{\\bigg(- \\frac{(a_{jl} - \\mu_{c_k,j})^2}{2 \\sigma^2_{c_k,j}}\\bigg)}\\bigg]}\\\\\n",
    "&= \\log^{ \\frac{1}{\\sqrt{2 \\pi \\sigma^2_{c_k,j}}} } + \\log^{ \\exp{\\bigg(- \\frac{(a_{jl} - \\mu_{c_k,j})^2}{2 \\sigma^2_{c_k,j}}\\bigg)} }\\\\\n",
    "&= - \\frac{1}{2} \\log^{2 \\pi \\sigma^2_{c_k,j}} - \\frac{1}{2} \\frac{(a_{jl} - \\mu_{c_k,j})^2}{\\sigma^2_{c_k,j}}\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "所以，高斯朴素贝叶斯就可以变形为：\n",
    "\n",
    "$$\n",
    "y = \\mathop{\\arg\\max}_{c_k} \\bigg[ \\log^{ \\ P(y = c_k)} + \\sum_j \\big( - \\frac{1}{2} \\log^{2 \\pi \\sigma^2_{c_k,j}} - \\frac{1}{2} \\frac{(a_{jl} - \\mu_{c_k,j})^2}{\\sigma^2_{c_k,j}} \\big) \\bigg]\n",
    "$$\n",
    "\n",
    "上式就是我们需要求的，我们要求出 $K$ 个值，然后求最大的那个对应的 $k$。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. 导入数据集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "spambase = np.loadtxt('data/spambase/spambase.data', delimiter = \",\")\n",
    "spamx = spambase[:, :57]\n",
    "spamy = spambase[:, 57]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. 划分数据集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "trainX, testX, trainY, testY = train_test_split(spamx, spamy, test_size = 0.4, random_state = 32)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((2760, 57), (2760,), (1841, 57), (1841,))"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainX.shape, trainY.shape, testX.shape, testY.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. 实现高斯朴素贝叶斯"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "朴素贝叶斯的实现非常简单，但是首先需要大家掌握几个技巧的使用"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. python dict\n",
    "\n",
    "python的字典，给定字典a，使用`a[key] = value`，将 `{key: value}` 键值对添加进a中"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 初始化字典test_dict\n",
    "test_dict = dict()\n",
    "\n",
    "# 将 {'a': 1} 存入test_dict\n",
    "# YOUR CODE HERE\n",
    "test_dict['a'] = 1\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n"
     ]
    }
   ],
   "source": [
    "# 测试样例\n",
    "print(test_dict['a']) # 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. np.mean\n",
    "\n",
    "求均值，使用 `axis = 0` 这个参数对每列取均值，`axis = 1` 对每行取均值，使用 `keepdims = True` 使结果保持之前的维数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1813, 57)\n"
     ]
    }
   ],
   "source": [
    "# 取 spamy 为1 对应的 spamx 的行\n",
    "test_matrix = spamx[spamy == 1, :]\n",
    "print(test_matrix.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 求 test_matrix 每列的均值，存入test_mean中\n",
    "# YOUR CODE HERE\n",
    "test_mean = np.mean(test_matrix,axis = 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "595.0620441257583\n",
      "(57,)\n"
     ]
    }
   ],
   "source": [
    "# 测试样例\n",
    "print(test_mean.sum()) # 595.062044126\n",
    "print(test_mean.shape) # (1, 57)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. np.var\n",
    "\n",
    "求方差，使用 `axis = 0` 这个参数对每列取方差，`axis = 1` 对每行取方差，使用 `keepdims = True` 使结果保持之前的维数。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 求 test_matrix 每列的方差，存入test_var中\n",
    "# YOUR CODE HERE\n",
    "test_var = np.var(test_matrix,axis = 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "772407.5060043654\n",
      "(57,)\n"
     ]
    }
   ],
   "source": [
    "# 测试样例\n",
    "print(test_var.sum()) # 772407.506004\n",
    "print(test_var.shape) # (1, 57)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. 将test_mean和test_var存入字典test_dict中\n",
    "\n",
    "将`{'mean': test_mean}` 和 `{'var': test_var}` 存入`test_dict`中"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "test_dict['mean'] = test_mean\n",
    "test_dict['var'] = test_var\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dict_keys(['a', 'mean', 'var'])\n"
     ]
    }
   ],
   "source": [
    "# 测试样例\n",
    "print(test_dict.keys())  # dict_keys(['a', 'mean', 'var'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. numpy的索引\n",
    "\n",
    "我们在预测的时候，需要使用numpy索引的一个小技巧\n",
    "\n",
    "给定一个列表，里面有3个字符串`'a', 'b', 'c'`，分别表示三个类别，给定一个`np.ndarray([1, 2, 0, 1, 0])`，我们可以执行以下代码观察numpy强大的索引功能"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['a' 'b' 'c']\n",
      "[1 2 0 1 0]\n",
      "['b' 'c' 'a' 'b' 'a']\n",
      "<class 'numpy.ndarray'>\n"
     ]
    }
   ],
   "source": [
    "labels = ['a', 'b', 'c']\n",
    "\n",
    "# 首先把 labels 变成 np.ndarray\n",
    "np_labels = np.array(labels)\n",
    "\n",
    "print(np_labels)\n",
    "\n",
    "# 新建索引\n",
    "index = np.array([1, 2, 0, 1, 0])\n",
    "\n",
    "print(index)\n",
    "\n",
    "# 使用index来检索np_labels\n",
    "results = np_labels[index]\n",
    "\n",
    "print(results)\n",
    "print(type(results))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看到，我们可以使用 `np.ndarray` 一次检索多个值，返回值会以 `np.ndarray` 的形式返回"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6. np.argmax\n",
    "\n",
    "这个是求最大值的下标用的，`axis`参数用来控制是每行还是每列"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0.9 0.1]\n",
      " [0.4 0.6]\n",
      " [0.1 0.9]]\n"
     ]
    }
   ],
   "source": [
    "test_array = np.array([[0.9, 0.1],\n",
    "                       [0.4, 0.6],\n",
    "                       [0.1, 0.9]])\n",
    "\n",
    "print(test_array)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 2], dtype=int64)"
      ]
     },
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.argmax(test_array, axis = 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看到，使用`axis = 0`，就是返回每列最大值的下标，分别是0和2。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 求每行最大值的下标\n",
    "# YOUR CODE HERE\n",
    "test_argmax = np.argmax(test_array, axis = 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0 1 1]\n"
     ]
    }
   ],
   "source": [
    "# 测试样例\n",
    "print(test_argmax) # [0 1 1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**接下来我们开始实现高斯朴素贝叶斯**，我们以类的形式实现这个高斯朴素贝叶斯。因为朴素贝叶斯是懒惰学习，所以这个模型只有在预测的时候，会进行大量的运算。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {},
   "outputs": [],
   "source": [
    "class myGaussianNB:\n",
    "    '''\n",
    "    处理连续特征的高斯朴素贝叶斯\n",
    "    '''\n",
    "    def __init__(self):\n",
    "        '''\n",
    "        初始化四个字典\n",
    "        self.label_mapping     类标记 与 下标(int)\n",
    "        self.probability_of_y  类标记 与 先验概率(float)\n",
    "        self.mean              类标记 与 均值(np.ndarray)\n",
    "        self.var               类标记 与 方差(np.ndarray)\n",
    "        '''\n",
    "        self.label_mapping = dict()\n",
    "        self.probability_of_y = dict()\n",
    "        self.mean = dict()\n",
    "        self.var = dict()\n",
    "        \n",
    "    def _clear(self):\n",
    "        '''\n",
    "        为了防止一个实例反复的调用fit方法，我们需要每次调用fit前，将之前学习到的参数删除掉\n",
    "        '''\n",
    "        self.label_mapping.clear()\n",
    "        self.probability_of_y.clear()\n",
    "        self.mean.clear()\n",
    "        self.var.clear()\n",
    "    \n",
    "    def fit(self, trainX, trainY):\n",
    "        '''\n",
    "        这里，我们要根据trainY内的类标记，针对每类，计算这类的先验概率，以及这类训练样本每个特征的均值和方差\n",
    "\n",
    "        Parameters\n",
    "        ----------\n",
    "            trainX: np.ndarray, 训练样本的特征, 维度：(样本数, 特征数)\n",
    "        \n",
    "            trainY: np.ndarray, 训练样本的标记, 维度：(样本数, )\n",
    "        '''\n",
    "        \n",
    "        # 先调用_clear\n",
    "        self._clear()\n",
    "        \n",
    "        # 获取类标记\n",
    "        labels = np.unique(trainY)\n",
    "        print(trainX.shape)\n",
    "        # 添加类标记与下标的映射关系\n",
    "        self.label_mapping = {label: index for index, label in enumerate(labels)}\n",
    "        \n",
    "        # 遍历每个类\n",
    "        for label in labels:\n",
    "            \n",
    "            # 取出为label这类的所有训练样本，存为 x\n",
    "            x = trainX[trainY == label, :]\n",
    "            \n",
    "            # 计算先验概率，用 x 的样本个数除以训练样本总个数，存储到 self.probability_of_y 中，键为 label，值为先验概率\n",
    "            # YOUR CODE HERE\n",
    "            self.probability_of_y[label] = x.shape[0] / trainX.shape[0] #not sure\n",
    "            \n",
    "            # 对 x 的每列求均值，使用 keepdims = True 保持维度，存储到 self.mean 中，键为 label，值为每列的均值组成的一个二维 np.ndarray\n",
    "            # YOUR CODE HERE\n",
    "            self.mean[label] =  np.mean(x,axis = 0,keepdims = True)\n",
    "            \n",
    "            # 这句话是debug用的，如果不满足下面的条件，会直接跳出\n",
    "            assert self.mean[label].shape == (1, trainX.shape[1])\n",
    "            \n",
    "            # 对 x 的每列求方差，使用 keepdims = True 保持维度，存储到 self.var 中，键为 label，值为每列的方差组成的一个二维 np.ndarray\n",
    "            # YOUR CODE HERE\n",
    "            self.var[label] = np.var(x,axis = 0,keepdims = True)\n",
    "            \n",
    "            # debug\n",
    "            assert self.var[label].shape == (1, trainX.shape[1])\n",
    "            \n",
    "            # 平滑，因为方差在公式的分母部分，我们要加一个很小的数，防止除以0\n",
    "            self.var[label] += 1e-9 * np.var(trainX, axis = 0).max()\n",
    "        \n",
    "    def predict(self, testX):\n",
    "        '''\n",
    "        给定测试样本，预测测试样本的类标记，这里我们要实现化简后的公式\n",
    "\n",
    "        Parameters\n",
    "        ----------\n",
    "            testX: np.ndarray, 测试的特征, 维度：(测试样本数, 特征数)\n",
    "    \n",
    "        Returns\n",
    "        ----------\n",
    "            prediction: np.ndarray, 预测结果, 维度：(测试样本数, )\n",
    "        '''\n",
    "        \n",
    "        # 初始化一个空矩阵 results，存储每个样本属于每个类的概率，维度是 (测试样本数，类别数)，每行表示一个样本，每列表示一个特征\n",
    "        results = np.empty((testX.shape[0], len(self.probability_of_y)))\n",
    "        \n",
    "        # 初始化一个列表 labels，按 self.label_mapping 的映射关系存储所有的标记，一会儿会在下面的循环内部完成存储\n",
    "        labels = [0] * len(self.probability_of_y)\n",
    "        \n",
    "        # 遍历当前的类，label为类标记，index为下标，我们将每个样本预测出来的这个 label 的概率，存到 results 中的第 index 列\n",
    "        for label, index in self.label_mapping.items():\n",
    "            \n",
    "            # 先验概率存为 py\n",
    "            py = self.probability_of_y[label]\n",
    "            \n",
    "            # 使用变换后的公式，计算所有特征的条件概率之和，存为sum_of_conditional_probability\n",
    "            # YOUR CODE HERE\n",
    "            sum_of_conditional_probability = - 0.5 * np.sum(((testX - self.mean[label]) ** 2) / self.var[label], axis = 1)\n",
    "            sum_of_conditional_probability += - 0.5 * np.sum(np.log(2 * np.pi * self.var[label]))\n",
    "            #print(len(testX))\n",
    "            \n",
    "            # debug\n",
    "            assert sum_of_conditional_probability.shape == (len(testX), )\n",
    "            \n",
    "            # 使用变换后的公式，将 条件概率 与 log先验概率 相加，存为result，维度应该是 (测试样本数, )\n",
    "            # YOUR CODE HERE\n",
    "            result = sum_of_conditional_probability + np.log(py) #条件概率\n",
    "            #print(result.shape)\n",
    "            # debug\n",
    "           # assert result.shape == (len(testX), )\n",
    "    \n",
    "            # 将所有测试样本属于当前这类的概率，存入到results中\n",
    "            results[:, index] = result\n",
    "            \n",
    "            # 将当前的label，按index顺序放入到labels中\n",
    "            labels[index] = label\n",
    "        \n",
    "        # 将labels转换为np.ndarray\n",
    "        np_labels = np.array(labels)\n",
    "        \n",
    "        # 循环结束后，就计算出了给定测试样本，当前样本属于这类的概率的近似值，存放在了results中，每行对应一个样本，每列对应一个特征\n",
    "        # 我们要求每行的最大值对应的下标，也就是求每个样本，概率值最大的那个下标是什么，结果存入max_prob_index中\n",
    "        # YOUR CODE HERE\n",
    "        max_prob_index = np.argmax(results, axis = 1)\n",
    "        # debug\n",
    "        assert max_prob_index.shape == (len(testX), )\n",
    "        \n",
    "        # 现在得到了每个样本最大概率对应的下标，我们需要把这个下标变成 np_labels 中的标记\n",
    "        # 使用上面小技巧中的第五点求解\n",
    "        # YOUR CODE HERE\n",
    "        prediction = np_labels[max_prob_index]\n",
    "        \n",
    "        # debug\n",
    "        assert prediction.shape == (len(testX), )\n",
    "        \n",
    "        # 返回预测结果\n",
    "        return prediction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(2760, 57)\n",
      "精度： 0.8126018468223791\n",
      "查准率： 0.6829511465603191\n",
      "查全率： 0.9620786516853933\n",
      "f1： 0.7988338192419826\n"
     ]
    }
   ],
   "source": [
    "# 测试样例\n",
    "from sklearn.metrics import accuracy_score\n",
    "from sklearn.metrics import precision_score\n",
    "from sklearn.metrics import recall_score\n",
    "from sklearn.metrics import f1_score\n",
    "model = myGaussianNB()\n",
    "model.fit(trainX, trainY)\n",
    "print(\"精度：\",accuracy_score(testY, model.predict(testX)))  # 0.812\n",
    "print(\"查准率：\",precision_score(testY,model.predict(testX)))\n",
    "print(\"查全率：\",recall_score(testY,model.predict(testX)))\n",
    "print(\"f1：\",f1_score(testY,model.predict(testX)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# 4. 计算其他的指标\n",
    "\n",
    "###### 双击此处填写\n",
    "\n",
    "查准率|查全率|F1\n",
    "-|-|-\n",
    "0.68|0.96|0.799"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "**其实现了较高的查全率**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
